In this post we will use a standard computer vision dataset – Dogs vs. Cats dataset that involves classifying photos as either containing a dog or cat.
Although, the dataset seems to be pretty simple, the goal would be to outline the steps required to solve image processing and classification using pytorch and the same pipeline can be later used to apply to any image classification problem at hand.
DataSet
The dogs vs cats dataset refers to a dataset used for a Kaggle machine learning competition held in 2013.The dataset is comprised of photos of dogs and cats.
Data Loading and Preprocessing
Let’s first call all the heavenly gods of python (import necessary libraries).
Read Dataset
The above directory structure (one folder per class) is used by many computer vision datasets, and most deep learning libraries provide utilites for working with such datasets. We can use the ImageFolder class from torchvision to load the data as PyTorch tensors.
Dataset class torch.utils.data.Dataset is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the following methods:
- __len__ so that len(dataset) returns the size of the dataset.
- __getitem__ to support the indexing such that dataset[i] can be used to get ith sample
PyTorch datasets allow us to specify one or more transformation functions which are applied to the images as they are loaded. torchvision.transforms contains many such predefined functions, and we’ll use the Resize to resize the image and ToTensor transform to convert images into PyTorch tensors.
As a internal functionality, the list of classes is stored in the .classes property of the dataset. The numeric label for each element corresponds to index of the element’s label in the list of classes.
We can view the image using matplotlib, but we need to change the tensor dimensions to (100,100,3). Notice, for matplotlib the channel should be specified in the third position. Let’s create a helper function to display an image and its label.

Training and Validation Datasets
While building real world machine learning models, it is quite common to split the dataset into 3 parts:
- Training set — used to train the model i.e. compute the loss and adjust the weights of the model using gradient descent.
- Validation set — used to evaluate the model while training, adjust hyperparameters (learning rate etc.) and pick the best version of the model.
- Test set — used to compare different models, or different types of modeling approaches, and report the final accuracy of the model. Since there’s no predefined validation set, we can set aside a small portion of the training set to be used as the validation set. We’ll use the random_split helper method from PyTorch to do this. To ensure that we always create the same validation set, we’ll also set a seed for the random number generator
DataLoader
We can now created data loaders to help us load the data in batches. Large datasets requires loading them into memory all at once. This leads to memory outage and slowing down of programs. PyTorch offers a solution for parallelizing the data loading process with the support of automatic batching as well. This is the DataLoader class present within the torch.utils.data package.
We’ll use a batch size of 64, we will load 64 samples at a time until all the images in the training set are loaded and trained to complete 1 epoch.
What is batch size?
The number of samples (data points) that would be passed through the network at a time.
What is epoch?
An epoch is one single pass of all the input data through the network.
Relation between batch_size and epoch?
batch_size is not equal to epoch, consider you have 1000 images. Processing all the 1000 images through the network once is considered as 1 epoch. If we set the batch size as 10, during training we will be passing 100 data points (=1000/10) at a time until we eventually pass in all the training data to complete 1 single epoch.
Generally, larger the batch size faster the training. However, you need to have enough hardware to handle. Sometimes, even if our machine can handle heavy computation,by setting larger batch size quality of the model could degrade and could create difficulty in generalizing.
Evaluation Metric and Loss Function
Let’s first define out evaluation metric, we need a way to evaluate how well our model is performing. A natural way to do this would be to find the percentage of labels that were predicted correctly i.e. the accuracy of the prediction
Here we are using torch.max() function, this function’s default behaviour as you can guess by the name is to return maximum among the elements in the Tensor. However, this function also helps get the maximum along a particular dimension, as a Tensor, instead of a single element. To specify the dimension (axis — in numpy), there is another optional keyword argument, called dim. This represents the direction that we take for the maximum.
max_elements, max_indices = torch.max(input_tensor, dim)
- dim=0, (maximum along columns).
- dim=1 (maximum along rows).
This returns a tuple, max_elements and max_indices.
- max_elements -> All the maximum elements of the Tensor.
- max_indices -> Indices corresponding to the maximum elements.
In the above accuracy function, the == performs an element-wise comparison of two tensors with the same shape, and returns a tensor of the same shape, containing 0s for unequal elements, and 1s for equal elements. Passing the result to torch.sum returns the number of labels that were predicted correctly. Finally, we divide by the total number of images to get the accuracy.
Loss Function
While the accuracy is a great way for us (humans) to evaluate the model, it can’t be used as a loss function for optimizing our model using gradient descent, for the following reasons:
- It’s not a differentiable function. torch.max and == are both non-continuous and non-differentiable operations, so we can’t use the accuracy for computing gradients w.r.t the weights and biases.
- It doesn’t take into account the actual probabilities predicted by the model, so it can’t provide sufficient feedback for incremental improvements.
Due to these reasons, accuracy is a great evaluation metric for classification, but not a good loss function. A commonly used loss function for classification problems is the cross entropy,
How Cross Entropy works
- For each output row, pick the predicted probability for the correct label. E.g. if the predicted probabilities for an image are [0.1, 0.3, 0.2, …] and the correct label is 1, we pick the corresponding element 0.3 and ignore the rest.
- Then, take the logarithm of the picked probability. If the probability is high i.e. close to 1, then its logarithm is a very small negative value, close to 0. And if the probability is low (close to 0), then the logarithm is a very large negative value. We also multiply the result by -1, which results is a large positive value of the loss for poor predictions.
- Finally, take the average of the cross entropy across all the output rows to get the overall loss for a batch of data.
Unlike accuracy, cross-entropy is a continuous and differentiable function that also provides good feedback for incremental improvements in the model (a slightly higher probability for the correct label leads to a lower loss). This makes it a good choice for the loss function.
PyTorch provides an efficient and tensor-friendly implementation of cross entropy as part of the torch.nn.functional package. Moreover, it also performs softmax internally, so we can directly pass in the outputs of the model without converting them into probabilities.
Define Model
Firstly, lets define a simple CNN (Convolution Neural Network).
We can view the summary of the model using torch summary, you could install this using “pip install torchsummary”
Using a GPU
As the sizes of our models and datasets increase, we need to use GPUs to train our models within a reasonable amount of time. GPUs contain hundreds of cores that are optimized for performing expensive matrix operations on floating point numbers in a short time, which makes them ideal for training deep neural networks with many layers.
We can now wrap our data loaders using DeviceDataLoader.
Training the Model
Before we train the model, we need to ensure that the data and the model’s parameters (weights and biases) are on the same device (CPU or GPU). We can reuse the to_device function to move the model’s parameters to the right device.
Fit and Evaluate the model
We see that the validation loss is higher than the training loss, this is a highly overfit model.
Regularization
Lets try to improve upon the model performance by adding regularization techniques like dropout and batch norm.
Since the data consists of color images with 3 channels (RGB), each image tensor has the shape of varying size.So we are resizing the image to have same shape (50,50) and then normalize the inputs using transform.
Model with Regularisation
Model Training and Evaluation
We see improvement in model performance compared to previous architecture, there is a bump in the accuracy from 79% to 83%.